The aim of this study is to conduct a market basket analysis using the unsupervised machine learning technique Association Rules, employing the Apriori algorithm. The analysis is extremely useful for learning about consumer preferences. It is possible to find out which products a consumer is more likely to reach for, by knowing that he or she has already reached for a particular product or set of products. With this knowledge, the company is able to plan more effectively a sales or discounting policy for specific products.
The data set used in this analysis can be found on Kaggle. It contains basket data for 7501 transactions of 119 unique products.
library(knitr)
library(arules)
library(arulesViz)
baskets <- read.transactions("Data\\Market_Basket_Optimisation.csv", sep = ",")
summary(baskets)
## transactions as itemMatrix in sparse format with
## 7501 rows (elements/itemsets/transactions) and
## 119 columns (items) and a density of 0.03288973
##
## most frequent items:
## mineral water eggs spaghetti french fries chocolate
## 1788 1348 1306 1282 1229
## (Other)
## 22405
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 1754 1358 1044 816 667 493 391 324 259 139 102 67 40 22 17 4
## 18 19 20
## 1 2 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 3.914 5.000 20.000
##
## includes extended item information - examples:
## labels
## 1 almonds
## 2 antioxydant juice
## 3 asparagus
Our data contains 7501 rows, each representing a single transaction, and 119 columns corresponding to unique products. The highest number of products bought in one transaction is 20 and the most frequently purchased product was mineral water. On average, consumers bought 3 products during one transaction, however, most transactions did not exceed 5 products.
head(sort(itemFrequency(baskets, type="absolute"), decreasing = T), 20)
## mineral water eggs spaghetti french fries
## 1788 1348 1306 1282
## chocolate green tea milk ground beef
## 1229 991 972 737
## frozen vegetables pancakes burgers cake
## 715 713 654 608
## cookies escalope low fat yogurt shrimp
## 603 595 574 536
## tomatoes olive oil frozen smoothie turkey
## 513 494 475 469
head(sort(itemFrequency(baskets, type="absolute"), decreasing = F), 20)
## water spray napkins cream bramble
## 3 5 7 14
## tea chutney mashed potato chocolate bread
## 29 31 31 32
## dessert wine ketchup oatmeal babies food
## 33 33 33 34
## sandwich asparagus cauliflower corn
## 34 36 36 36
## salad shampoo hand protein bar mint green tea
## 37 37 39 42
itemFrequencyPlot(baskets, topN = 15, main = "Support of 15 most frequent products")
The plot above shows the 15 most frequently purchased products presented on a relative scale, also known as a “support”. More specifically, support shows how often a given set of elements or rule occurs in a data set. Only 10 products have support above 10%.
Let’s run the apriori algorithon on out data, setting threshold for support equal to 0.1 and confidence equal to 0.4, and inspect the results sorting by respectively: support, confidence and lift.
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.4 0.1 1 none FALSE TRUE 5 0.01 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 75
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [75 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [18 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
r1 <- inspect(sort(rules1 , by = "support")[1:5])
| lhs | rhs | support | confidence | coverage | lift | count | ||
|---|---|---|---|---|---|---|---|---|
| [1] | {ground beef} | => | {mineral water} | 0.0409279 | 0.4165536 | 0.0982536 | 1.747521 | 307 |
| [2] | {olive oil} | => | {mineral water} | 0.0275963 | 0.4190283 | 0.0658579 | 1.757904 | 207 |
| [3] | {soup} | => | {mineral water} | 0.0230636 | 0.4564644 | 0.0505266 | 1.914955 | 173 |
| [4] | {salmon} | => | {mineral water} | 0.0170644 | 0.4012539 | 0.0425277 | 1.683337 | 128 |
| [5] | {ground beef, spaghetti} | => | {mineral water} | 0.0170644 | 0.4353741 | 0.0391948 | 1.826477 | 128 |
The most common association rule in our dataset is that when a consumer buys ground beef, he/she will most likely also buy mineral water.
r2 <- inspect(sort(rules1 , by = "confidence")[1:5])
| lhs | rhs | support | confidence | coverage | lift | count | ||
|---|---|---|---|---|---|---|---|---|
| [1] | {eggs, ground beef} | => | {mineral water} | 0.0101320 | 0.5066667 | 0.0199973 | 2.125563 | 76 |
| [2] | {ground beef, milk} | => | {mineral water} | 0.0110652 | 0.5030303 | 0.0219971 | 2.110308 | 83 |
| [3] | {chocolate, ground beef} | => | {mineral water} | 0.0109319 | 0.4739884 | 0.0230636 | 1.988472 | 82 |
| [4] | {frozen vegetables, milk} | => | {mineral water} | 0.0110652 | 0.4689266 | 0.0235969 | 1.967236 | 83 |
| [5] | {soup} | => | {mineral water} | 0.0230636 | 0.4564644 | 0.0505266 | 1.914955 | 173 |
When analysing association rules, we should pay attention to the important measure of confidence. It tells us the percentage of transactions where having a given item or set X leads to having an item or set Y. In our data, the highest confidence has the rule saying that when a consumer decides to buy eggs and ground beef, he/she is most likely to also buy mineral water.
r3 <- inspect(sort(rules1 , by = "lift")[1:5])
| lhs | rhs | support | confidence | coverage | lift | count | ||
|---|---|---|---|---|---|---|---|---|
| [1] | {ground beef, mineral water} | => | {spaghetti} | 0.0170644 | 0.4169381 | 0.0409279 | 2.394681 | 128 |
| [2] | {eggs, ground beef} | => | {mineral water} | 0.0101320 | 0.5066667 | 0.0199973 | 2.125563 | 76 |
| [3] | {ground beef, milk} | => | {mineral water} | 0.0110652 | 0.5030303 | 0.0219971 | 2.110308 | 83 |
| [4] | {chocolate, ground beef} | => | {mineral water} | 0.0109319 | 0.4739884 | 0.0230636 | 1.988472 | 82 |
| [5] | {frozen vegetables, milk} | => | {mineral water} | 0.0110652 | 0.4689266 | 0.0235969 | 1.967236 | 83 |
Lift is also an important measure in the study of association rules. It assesses the strength of the relationship between two items in a transaction dataset, defined as the ratio of the observed support for the itemset (the presence of both items X and Y) to the expected support, assuming independence between the items. A lift value greater than 1 indicates a positive association between the items, meaning that the presence of item X increases the likelihood of item Y also being present. A lift value less than 1 indicates a negative association, and a lift value of 1 indicates independence between the items. Our results show that buying ground beef and mineral water increases the likelihood of also buying spaghetti.
plot(rules1, engine = "visNetwork", method="graph", limit = 10)
plot(rules1, engine = "default", method="paracoord", limit = 10, main = "Parallel coordinates plot for 10 strongest rules")
We can also check what drives consumers to buy a particular product. Let’s check for french fries.
rules.frenchFries.rhs <- apriori(data=baskets, parameter=list(supp=0.01,conf = 0.04),
appearance=list(default="lhs", rhs="french fries"), control=list(verbose=F))
rules.frenchFries.rhs.bylift<-sort(rules.frenchFries.rhs, by="lift", decreasing=TRUE)
f1 <- inspect(head(rules.frenchFries.rhs.bylift))
| lhs | rhs | support | confidence | coverage | lift | count | ||
|---|---|---|---|---|---|---|---|---|
| [1] | {burgers} | => | {french fries} | 0.0219971 | 0.2522936 | 0.0871884 | 1.476173 | 165 |
| [2] | {frozen smoothie} | => | {french fries} | 0.0145314 | 0.2294737 | 0.0633249 | 1.342654 | 109 |
| [3] | {cake} | => | {french fries} | 0.0178643 | 0.2203947 | 0.0810559 | 1.289533 | 134 |
| [4] | {green tea} | => | {french fries} | 0.0285295 | 0.2159435 | 0.1321157 | 1.263488 | 214 |
| [5] | {pancakes} | => | {french fries} | 0.0201306 | 0.2117812 | 0.0950540 | 1.239135 | 151 |
| [6] | {chocolate} | => | {french fries} | 0.0343954 | 0.2099268 | 0.1638448 | 1.228285 | 258 |
The results show that consumers are more likely to reach for french fries when they have had among others burgers, frozen smoothie or cake in their basket beforehand.
plot(rules.frenchFries.rhs, engine = "htmlwidget", method="grouped")
And the opposite situation: what additional will consumer buy if french fries in his basket?
rules.frenchFries.lhs <- apriori(data=baskets, parameter=list(supp=0.01,conf = 0.04),
appearance=list(default="rhs", lhs="french fries"), control=list(verbose=F))
rules.frenchFries.lhs.bylift<-sort(rules.frenchFries.lhs, by="lift", decreasing=TRUE)
f2 <- inspect(head(rules.frenchFries.lhs.bylift))
| lhs | rhs | support | confidence | coverage | lift | count | ||
|---|---|---|---|---|---|---|---|---|
| [1] | {french fries} | => | {burgers} | 0.0219971 | 0.1287051 | 0.1709105 | 1.476173 | 165 |
| [2] | {french fries} | => | {frozen smoothie} | 0.0145314 | 0.0850234 | 0.1709105 | 1.342654 | 109 |
| [3] | {french fries} | => | {cake} | 0.0178643 | 0.1045242 | 0.1709105 | 1.289533 | 134 |
| [4] | {french fries} | => | {green tea} | 0.0285295 | 0.1669267 | 0.1709105 | 1.263488 | 214 |
| [5] | {french fries} | => | {pancakes} | 0.0201306 | 0.1177847 | 0.1709105 | 1.239135 | 151 |
| [6] | {french fries} | => | {chocolate} | 0.0343954 | 0.2012480 | 0.1709105 | 1.228285 | 258 |
The results show that consumers are more likely to reach for products such as a burger, frozen smoothie or cake when they have french fries in their basket.
plot(rules.frenchFries.lhs, engine = "htmlwidget", method="grouped")
In this article, a market basket analysis has been carried out using the unsupervised machine learning technique Association rules, involving the Apriori algorithm. Mineral water proved to be the most frequently purchased item, dominating most rules, and the strongest rule measured by lift turned out to be: having ground beef and mineral water in the basket increases the likelihood of also buying spaghetti. In addition, the association rules between french fries and other products were examined in detail. The results indicated that the purchase of french fries slightly increases the likelihood of purchasing among others burgers, frozen smoothie and cake, and that this relationship is reciprocal. The above analysis has provided insight into the association rules in consumers’ choice of specific products, and may prove useful in planning a sales or discounting policy for specific products.